I want to thank my mentors for their valuable inputs to this project:
Dr. Blanca Himes, Associate Professor, Biostatistics, Epidemiology and Informatics; for helping me come up with the objectives of my project and timely feedback,
Dr. Jesse Hsu, Assistant Professor, Biostatistics, Epidemiology and Informatics; for guiding me through the descriptive statistics and data presentation,
Sherrie Xie, VMD-PhD Candidate, Epidemiology; for helping me with the GIS maps.
My project aims to summarize and visualize the trends in the use of animals in biomedical research by species, pain levels and state in the US the over the period from 2008 to 2019. The data is obtained from the US Department of Agriculture’s APHIS annual reports. Following this exploratory analysis, a prediction for the number of animals that will be used over the next 5 years will be made.
The use of animals in biomedical research is ubiquitous. There are two main categories for animal use: basic biomedical sciences research and (mimicking a human disease in an animal model and studying gene expression, molecular mechanisms, etc.) and drug testing (toxicity and efficacy of an experimental drug). A growing body of studies and voices from the scientific community have pointed out the poor reliability and predictive value of animal models for human outcomes and for understanding human physiology. (Akhtar 2015) In 2004, the FDA estimated that 92% of drugs that pass preclinical tests in animal models fail in human clinical trials. (A. 2004) More recent analysis suggests that, in spite of efforts to improve the predictability of animal testing, the failure rate has increased to 96%. (Pippin 2012)
In 1938, Congress passed the U.S. Federal Food, Drug, and Cosmetic Act, mandating animal toxicity testing. As of October 7, 2021, Congress introduced the bipartisan FDA Modernization Act to end animal testing mandates. This comes after the European Parliament resoundingly passed a resolution on September 16, 2021, with a vote of 667 to 4 to phase out animal testing. How has animal use in research changed over the past few years in the US? What is the significance of species, state, pain levels? Using the limited USDA data, I aim to answer these questions. Furthermore, a regression model will predict the number of animals that will be needed in the next few years if the same trends continue. This analysis could be of interest to biomedical companies to reduce their time and resources spent on animal models, FDA for revision of regulatory requirements, bioethics specialists and animal advocacy groups.
The trends in animal use could be influenced by technologies that enhanced the ease of building animal models, technologies that performed better than animal models, change in bioethical standards, change in public sentiment about the topic and even opinions voiced by the heads of government regulatory bodies. I am personally interested in this analysis because one of my career goals is to work towards developing and commercializing alternatives to animal methods in biomedical research.
options(scipen = 999)
knitr::opts_chunk$set(echo = TRUE)
library("tidyverse")
library("dplyr")
library("ggplot2")
library("gridExtra")
library("sf")
library("spData")
library("mapview")
library("leaflet")
USDA publishes an annual report of the number of animals used by state, species and pain category. The report for every year is published in January two years later. Hence the latest data I could obtain was from 2019. The reports were published starting 2008, hence I have data for a 12 year period.
The data is in the form of several PDFs. I cleaned the data manually into .txt format. The nomenclature is year_col_X where X is the pain category. The pain categories are:
Column B: Animals held by a facility but not used in any research that year
Column C: Animals used in research; no pain involved; no pain drugs administered
Column D: Animals used in research; pain involved; pain drugs administered
Column E: Animals used in research; pain involved; no pain drugs administered.
The following code chunk loads the data as a dataframe, stacked by year and column. The loaded data has 53 rows per year per column. This includes the 50 states, District of Columbia, Puerto Rico, and “REPORT TOTAL” which is the sum of all columns (species). Note that the species column is not an exhaustive list of all animals used in biomedical research but only those covered by the Animal Welfare Act. The AWA does not cover rats, mice, birds and reptiles, which happen to be over 95% of all animals used. Columns B, C, D and E are assigned values 0, 1, 2 and 3 to reflect pain levels in a more intuitive way.
#Loading the data using nested functions
datafr <- purrr::map_dfr(
.x = c(2008:2019),
.f = function(x){
purrr::map_dfr(
.x = c("B", "C", "D", "E"),
.f = function(x, y){
dat <- read.table(file = paste0("C:/Users/Deeksha Hegde/Downloads/BMIN503_Final_Project/USDA_Data/",y,"_col_",x,".txt"), header = TRUE, sep = "\t")
dat %>%
mutate(across(.cols = All_Other_Covered_Species:Total, .fns =~ as.numeric(str_remove_all(string = .x, ","))),
Year = y,
Column = x
)
}, y=x
)
}
)
#Checking if the data loaded is correct
datafr %>% count(Year, Column)
## Year Column n
## 1 2008 B 53
## 2 2008 C 53
## 3 2008 D 53
## 4 2008 E 53
## 5 2009 B 53
## 6 2009 C 53
## 7 2009 D 53
## 8 2009 E 53
## 9 2010 B 53
## 10 2010 C 53
## 11 2010 D 53
## 12 2010 E 53
## 13 2011 B 53
## 14 2011 C 53
## 15 2011 D 53
## 16 2011 E 53
## 17 2012 B 53
## 18 2012 C 53
## 19 2012 D 53
## 20 2012 E 53
## 21 2013 B 53
## 22 2013 C 53
## 23 2013 D 53
## 24 2013 E 53
## 25 2014 B 53
## 26 2014 C 53
## 27 2014 D 53
## 28 2014 E 53
## 29 2015 B 53
## 30 2015 C 53
## 31 2015 D 53
## 32 2015 E 53
## 33 2016 B 53
## 34 2016 C 53
## 35 2016 D 53
## 36 2016 E 53
## 37 2017 B 53
## 38 2017 C 53
## 39 2017 D 53
## 40 2017 E 53
## 41 2018 B 53
## 42 2018 C 53
## 43 2018 D 53
## 44 2018 E 53
## 45 2019 B 53
## 46 2019 C 53
## 47 2019 D 53
## 48 2019 E 53
datafr <- mutate(datafr, pain.level = factor(Column, levels = c("B", "C", "D", "E"), labels = c(0, 1, 2, 3)))
head(datafr)
## State All_Other_Covered_Species Cats Dogs Guinea_Pigs Hamsters Marine.Mammals
## 1 AK 426 0 6 0 190 0
## 2 AL 132 34 110 0 0 0
## 3 AR 92 69 22 15 0 0
## 4 AZ 144 8 26 0 0 0
## 5 CA 1870 409 118 735 60 0
## 6 CO 358 253 4 423 103 0
## Nonhuman_Primates Other_Farm_Animals Pig Rabbits Sheep Total Year Column
## 1 0 0 0 0 0 622 2008 B
## 2 0 612 4 321 6 1219 2008 B
## 3 34 0 0 0 0 232 2008 B
## 4 38 1 12 32 3 264 2008 B
## 5 6009 4473 45 4438 179 18336 2008 B
## 6 0 0 0 0 0 1141 2008 B
## pain.level
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
This project is divided into two main parts: descriptive statistics and prediction model. In the first part, I performed an exploratory analysis of the data, visualizing by year, state, and species. In the second part, a model to predict the numbers for the next 5 years was created.
Now let us visualize the numbers!
The first plot shows the total animals (all categories combined) by year. The second plot shows the same but with the breakup by column.
data.total <- filter(datafr, State == "REPORT TOTAL")
#Total by year
data.total1 <- summarise(group_by(data.total, Year), Total = sum(Total))
total.plot1 <- ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(900000, 1300000, by = 100000), labels = seq(9, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
#Total by year and column
total.plot2 <- ggplot(data = data.total, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year by Pain Levels") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
grid.arrange(total.plot1, total.plot2, ncol = 2)
We see that there is a huge rise in the total from 2009 to 2010, and a steady decline since then. Overall, there is a significant decline over the 12 year period.
From the second plot, we see that pain category 1, which involves no pain and no pain drugs, is the highest in magnitude throughout the study period. We also see that the decline in the total animals is mainly due to the decline in animals at pain level 1.
Now we will examine the total numbers by state.
#Total by column by state
data.by.state <- datafr %>%
filter(State != "REPORT TOTAL") %>%
group_by(State, pain.level) %>%
summarise(Total = sum(Total))
data.by.state.total <- group_by(data.by.state, State) %>%
summarize(Total.all = sum(Total))
state.plot.1 <- ggplot(data.by.state.total, aes(x = State, y = Total.all)) + geom_bar(stat = "identity") + ggtitle("Total number of animals used by state") + ylab(label = "Total across all pain levels in hundred thousands") + scale_y_continuous(breaks = seq(100000, 1300000, by = 100000), labels = seq(1, 13, 1)) +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 0.5),
axis.text.y = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
state.plot.2 <- ggplot(data.by.state, aes(x = State, y = Total, color = pain.level)) + geom_point(size = 4) + ggtitle("Total number of animals used by state by pain level") + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + ylab(label = "Total in hundred thousands") +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 0.5),
axis.text.y = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
state.plot.1
state.plot.2
From the graphs, we get the following information:
States using the highest number of animals overall across all pain levels: CA, MA, NJ
States using the highest number of animals by pain level:
0 CA, MD, TX
1 CA, MA, OH
2 CA, TX, MA
3 MO, MI, IA
An arguably better representation of the same data is the use of interactive maps.
usa <- st_read("C:/Users/Deeksha Hegde/Downloads/us-state-boundaries/us-state-boundaries.shp") %>%
rename(State = stusab)
## Reading layer `us-state-boundaries' from data source
## `C:\Users\Deeksha Hegde\Downloads\us-state-boundaries\us-state-boundaries.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.44069
## Geodetic CRS: WGS 84
head(usa)
## Simple feature collection with 6 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.8485 ymin: 14.03656 xmax: 146.1544 ymax: 49.00241
## Geodetic CRS: WGS 84
## gid arealand division intptlat
## 1 16 278176477 0 18.21765
## 2 23 472276664 0 14.93678
## 3 31 1627312771 7 34.89553
## 4 35 2136109036 5 38.64729
## 5 40 -1616974352 1 41.59742
## 6 54 312831514 9 47.40732
## name objectid areawater intptlon
## 1 Puerto Rico 50 628200285 -66.41080
## 2 Commonwealth of the Northern Mariana Islands 36 349301029 145.60102
## 3 Arkansas 44 -1334552525 -92.44463
## 4 West Virginia 1 489848791 -80.61833
## 5 Rhode Island 6 1323457457 -71.52727
## 6 Washington 20 -324557627 -120.57580
## oid funcstat centlon State state statens centlat
## 1 303146031 A -66.41476 PR 72 01779808 18.21647
## 2 -1625647860 A 145.59687 MP 69 01779809 16.79744
## 3 266078934 A -92.44270 AR 05 00068085 34.89402
## 4 -1929409300 A -80.62346 WV 54 01779805 38.64119
## 5 -1861167639 A -71.52472 RI 44 01219835 41.59403
## 6 -1859906639 A -120.59557 WA 53 01779804 47.41490
## basename mtfcc region lsadc geoid
## 1 Puerto Rico G4000 9 00 72
## 2 Commonwealth of the Northern Mariana Islands G4000 9 00 69
## 3 Arkansas G4000 3 00 05
## 4 West Virginia G4000 3 00 54
## 5 Rhode Island G4000 1 00 44
## 6 Washington G4000 4 00 53
## geometry
## 1 MULTIPOLYGON (((-67.20794 1...
## 2 MULTIPOLYGON (((145.5726 15...
## 3 MULTIPOLYGON (((-94.55218 3...
## 4 MULTIPOLYGON (((-81.74725 3...
## 5 MULTIPOLYGON (((-71.7897 41...
## 6 MULTIPOLYGON (((-123.2479 4...
to_map <- inner_join(usa, data.by.state.total, by = "State")
pal_fun <- colorNumeric("Blues", NULL) #Pick a different pal, not starting from white
pu_message <- paste0(to_map$State,
"<br>Number of animals used: ",
prettyNum(to_map$Total.all, big.mark = ","))
leaflet(to_map) %>%
addPolygons(stroke = FALSE,
fillColor = ~pal_fun(Total.all),
fillOpacity = 0.8, smoothFactor = 0.5,
popup = pu_message) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addLegend("bottomright",
pal=pal_fun,
values=~Total.all,
title = 'Total number of animals used',
opacity = 1) %>%
addScaleBar()
Now, since we’re in Pennsylvania, let us see what the plots for PA look like.
#Total for PA
data.PA <- filter(datafr, State == "PA")
PA.plot.1 <- ggplot(data = summarise(group_by(data.PA, Year), Total = sum(Total)), aes(x = Year, y = Total)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year in Pennsylvania") + theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
#Total by column for PA
PA.plot.2 <- ggplot(data = data.PA, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1) +
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Pennsylvania") + theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
grid.arrange(PA.plot.1, PA.plot.2, ncol = 2)
We see a huge decline from 2009 to 2015.
Next, I want to find out which states contributed the most to the decline observed in the first plot. I calculated the mean and standard deviation over the years for each state and sorted them in decreasing order of SD. Noticing that Arizona seems to have a higher SD than mean, which raised suspicion, I calculated the coefficient of variability (CV) and sorted in the decreasing order of CV.
#Std dev of total for all states
data.by.state.year <- datafr %>%
filter(State != "REPORT TOTAL") %>%
group_by(State, Year) %>%
summarise(Total = sum(Total))
state.std.dev <- summarise(data.by.state.year, mean = mean(Total), sd = sd(Total))
state.std.dev[order(state.std.dev$sd, decreasing = TRUE),]
## # A tibble: 52 x 3
## State mean sd
## <chr> <dbl> <dbl>
## 1 AZ 16261. 35984.
## 2 NY 45696. 25676.
## 3 CA 103133. 22675.
## 4 KS 11863. 20574.
## 5 PA 50591. 20247.
## 6 OH 64393. 18341.
## 7 IA 30172 15683.
## 8 NJ 67210. 13946.
## 9 MO 36277. 13905.
## 10 MD 58124. 13290.
## # ... with 42 more rows
state.std.dev <- state.std.dev %>% mutate(cv = sd/mean) %>%
arrange(desc(cv))
state.std.dev
## # A tibble: 52 x 4
## State mean sd cv
## <chr> <dbl> <dbl> <dbl>
## 1 AZ 16261. 35984. 2.21
## 2 KS 11863. 20574. 1.73
## 3 WV 2116. 2936. 1.39
## 4 NE 4406. 3886. 0.882
## 5 AK 636. 468. 0.735
## 6 HI 337 227. 0.675
## 7 DC 7995. 5279. 0.660
## 8 WY 543. 307. 0.565
## 9 NY 45696. 25676. 0.562
## 10 IN 17136. 9132. 0.533
## # ... with 42 more rows
The states having CV > 1 will be examined further.
#Total by column for AZ
data.AZ <- filter(datafr, State == "AZ")
ggplot(data = data.AZ, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Arizona")
Column C (pain level 1) of the year 2010 is clearly an outlier. I verified the data again from the USDA report to check if there was an error in data loading. It is most likely a typographical error in the USDA report. On examination, I found that the outlier is coming from the All_Other_Covered_Species column. I first replaced this value with NA and found the mean along the column for the rest of the years. I replaced the NA with the rounded mean and finally updated the total. The updated plot for AZ is also shown.
#Setting the outlier point to NA
data.AZ[, 2][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- NA
#Setting the outlier point to the mean of the column over the years excluding outlier year
data.AZ[, 2][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- round(mean(data.AZ[, 2][data.AZ['Column'] == "C"], na.rm = TRUE), digits = 0)
#Updating the Total column for the year
data.AZ[, 'Total'][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- sum(data.AZ[2:12][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010])
#Updated plot with outlier adjusted
ggplot(data = data.AZ, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Arizona (Outlier adjusted)")
The second state having CV > 1 is Kansas. Let’s look at its plot.
#Total by column for KS
data.KS <- filter(datafr, State == "KS")
ggplot(data = data.KS, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Kansas")
2019 pain level 1 is likely to be an outlier but unlike the case with Arizona, we cannot be sure since we don’t have the data for the years after 2019. I will not alter this point but exclude it in some analysis.
Finally, we have West Virginia.
#Total by column for WV
data.WV <- filter(datafr, State == "WV")
ggplot(data = data.WV, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1)+
scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in West Virginia")
WV does not seem to have outliers, since there is an increase and decrease in all columns over 4 years from 2015-2019.
Now that we have investigated outliers, let us go back to the Total number of animals vs. Year plots and revise them.
#Updating dataframe and total plots
datafr[datafr['State'] == "AZ" & datafr['Column'] == "C" & datafr['Year'] == 2010, ] <- data.AZ[data.AZ['Column'] == "C" & data.AZ['Year'] == 2010, ]
datafr[530, 2:13] <- colSums(datafr[478:529, 2:13])
data.total <- filter(datafr, State == "REPORT TOTAL")
#Total by year
data.total1 <- summarise(group_by(data.total, Year), Total = sum(Total))
total.plot.1.updated <- ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(900000, 1300000, by = 100000), labels = seq(9, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
#Total by year and column
total.plot.2.updated <- ggplot(data = data.total, aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 1) + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year by Pain Levels") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
grid.arrange(total.plot1, total.plot2, total.plot.1.updated, total.plot.2.updated, ncol = 2, left = "Before and after outlier adjustment")
Now we observe a more steady decline in the total number of animals over the years.
Next, let us look at a heat map of the percentage change by state over the 12 years. Kansas was excluded from this (set to NA) because it is a potential outlier that interefers with the data presentation done here.
data.state.percentage <- data.by.state.year[data.by.state.year['Year'] == 2008 | data.by.state.year['Year'] == 2019, ] %>%
group_by(State) %>%
summarize(percent = 100*(Total - first(Total))/first(Total)) %>%
filter(percent != 0)
data.state.percentage[data.state.percentage$State == "KS", "percent"] <- NA
data.state.percentage
## # A tibble: 52 x 2
## # Groups: State [52]
## State percent
## <chr> <dbl>
## 1 AK -76.1
## 2 AL -18.3
## 3 AR 36.5
## 4 AZ -20.4
## 5 CA -40.6
## 6 CO -16.0
## 7 CT -58.6
## 8 DC -28.1
## 9 DE -60.5
## 10 FL 11.4
## # ... with 42 more rows
to_map1 <- inner_join(usa, data.state.percentage, by = "State")
#Gradient for -ve and +ve
#KS taken as NA because cannot conclude as outlier but very high value
minval <- min(data.state.percentage$percent, na.rm = TRUE)
maxval <- max(data.state.percentage$percent, na.rm = TRUE)
domain <- c(minval, maxval)
colorPal <- c(colorRampPalette(colors = c("red", "yellow"), space = "Lab")(abs(minval)), colorRampPalette(colors = c("yellow", "green"), space = "Lab")(maxval))
##b2182b, #2166ac
pal_fun <- colorNumeric("BuPu", NULL)
pu_message <- paste0(to_map1$State,
"<br>Percentage change in total animals used from 2008 to 2019 (%): ",
round(to_map1$percent, digits = 0))
leaflet(to_map1) %>%
addPolygons(stroke = FALSE,
color = 'white',
fillOpacity = 0.8, smoothFactor = 0.5,
fillColor = ~get('colorBin')(colorPal, domain)(percent),
popup = pu_message) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addLegend("bottomright",
pal=colorBin(colorPal, domain = domain + 1),
values=~domain,
title = 'Percentage change in total animals used from 2008 to 2019 by state (%)',
opacity = 1) %>%
addScaleBar()
Next, I want to visualize the patterns in each of the states. The following chunk plots the data for each of the states over the 12 year period.
ggplot(data = datafr %>% filter(State != "REPORT TOTAL"), aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 2.5) + ggtitle("Total number of animals used by each state by pain levels") +
scale_x_continuous(breaks = c(2008:2019)) + facet_wrap( ~ State, ncol = 5, scales = "free") + theme(
strip.text.x = element_text(size = 40, color = "blue", face = "bold"),
plot.title = element_text(size = 50, face = "bold", hjust = 0.5),
axis.title = element_blank(),
axis.text = element_text(size = 8),
legend.position = "bottom",
legend.justification = "right",
legend.title = element_text(size = 50),
legend.text = element_text(size = 50),
legend.key.width = unit(x = 3, units = "in")
)
From this grid, we see that out of the top 10 states using the highest number of animals, 6 of them saw a decreasing trend in the study period.
To visualize the numbers by species, I first performed the same standard deviation analysis as before for species. The following chunks plot the mean and standard deviation over the years by species for the top 3 states, California, Massachusetts and New Jersey.
#Std dev of species for top states
data.CA <- filter(datafr, State == "CA") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
arrange(desc(sd)) %>% full_join(
filter(datafr, State == "CA") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = ~mean(x = .x, na.rm = TRUE))) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "mean"), by = "species") %>%
arrange(sd)
CA.species <- ggplot(data.CA %>% mutate(species = factor(species, levels = data.CA$species)), aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + geom_bar(mapping = aes(y = mean), stat="identity", fill="steelblue", width=0.5, alpha = 0.5, position = "dodge") + ggtitle("Mean and Standard deviation of species over 12 years in CA") + ylab("Mean (light blue) and SD (dark blue)") + xlab("Species") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
#Std dev of species for top states - MA
data.MA <- filter(datafr, State == "MA") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
arrange(desc(sd)) %>% full_join(
filter(datafr, State == "MA") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = ~mean(x = .x, na.rm = TRUE))) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "mean"), by = "species") %>%
arrange(sd)
MA.species <- ggplot(data.MA %>% mutate(species = factor(species, levels = data.MA$species)), aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + geom_bar(mapping = aes(y = mean), stat="identity", fill="steelblue", width=0.5, alpha = 0.5, position = "dodge") + ggtitle("Mean and Standard deviation of species over 12 years in MA") + ylab("Mean (light blue) and SD (dark blue)") + xlab("Species") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
#Std dev of species for top states - NJ
data.NJ <- filter(datafr, State == "NJ") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
arrange(desc(sd)) %>% full_join(
filter(datafr, State == "NJ") %>%
filter(State != "REPORT TOTAL") %>%
select(-State) %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
ungroup() %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = ~mean(x = .x, na.rm = TRUE))) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "mean"), by = "species") %>%
arrange(sd)
NJ.species <- ggplot(data.NJ %>% mutate(species = factor(species, levels = data.NJ$species)), aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + geom_bar(mapping = aes(y = mean), stat="identity", fill="steelblue", width=0.5, alpha = 0.5, position = "dodge") + ggtitle("Mean and Standard deviation of species over 12 years in NJ") + ylab("Mean (light blue) and SD (dark blue)") + xlab("Species") +
theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
grid.arrange(CA.species, MA.species, NJ.species, ncol = 1)
These plots show us the variation in each of the species by state. We see that in CA, rabbits are the most commonly used species and they have a high standard deviation. Relative to their numbers, farm animals and guinea pigs also showed high variation. In MA, hamsters showed the highest variation while guinea pigs were the highest used by number. In NJ, hamsters were the top species ad also showed high standard deviation. Relative to their numbers, non-human primates also showed high variation in NJ.
Next I want to visualize the patterns for each of the species over the 12 year period. The following chunk plots the same in one grid.
data.species.pl <- data.total %>%
group_by(pain.level) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE))
data.species.year <- data.total %>%
group_by(Year) %>%
summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "number")
#Plot grid for each species
ggplot(data = data.species.year, aes(x = Year, y = number)) + geom_line(size = 1) + ggtitle("Number of animals by species between 2008-2019") +
scale_x_continuous(breaks = c(2008:2019)) + facet_wrap( ~ species, ncol = 4, scales = "free") + theme(
strip.text.x = element_text(size = 8, color = "blue", face = "bold"),
plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
axis.title = element_blank(),
axis.text = element_text(size = 7),
legend.position = "bottom",
legend.justification = "right",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10),
legend.key.width = unit(x = 1, units = "in")
)
#% change for each species
data.species.percentage <- filter(data.species.year, (Year == 2008 | Year == 2019)) %>%
group_by(species) %>%
summarise(percent = 100*(number - first(number))/first(number)) %>%
filter(percent != 0) %>%
mutate(pos = percent >= 0) %>%
arrange(desc(percent))
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
#Bar plot of % change for each species
ggplot(data.species.percentage %>% mutate(species = factor(species, levels = data.species.percentage$species)), aes(x = species, y = percent, fill = pos)) + geom_bar(width = 0.8, position = position_dodge(width = 0.8), stat = "identity") + geom_text(data = data.species.percentage %>% filter(pos == FALSE), aes(label = round(percent, digits = 0)), color = 'red', size = 6, hjust=1.2, vjust=0.5) + geom_text(data = data.species.percentage %>% filter(pos == TRUE), aes(label = round(percent, digits = 0)), color = 'cyan', size = 6, hjust=-0.2, vjust=0.5) + xlab(label = "Species") + ylab(label = "Percentage change 2008-2019") + ggtitle("Percentage change from 2008 to 2019 by species") + coord_flip() +
theme(
plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
axis.title = element_blank(),
axis.text = element_text(size = 12),
legend.position = "none"
)
From the above plots, we have a clear picture of what trends were for each of the species as well as their percentage change over the 12 year period. We see that almost all categories saw a decrease. Among these, dogs, hamsters, non-human primates and rabbits saw a steep decline.
I want to find what the numbers would be for the next 5 years if the same trend continues. I fit a linear model to the total number of animals vs. year graph and also made a table to reflect the figures.
m1 <- lm(Total ~ Year, data = data.total1)
summary(m1)
##
## Call:
## lm(formula = Total ~ Year, data = data.total1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85230 -18560 -5143 25807 52073
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54799007 6887537 7.956 0.0000124 ***
## Year -26705 3421 -7.807 0.0000146 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40910 on 10 degrees of freedom
## Multiple R-squared: 0.859, Adjusted R-squared: 0.845
## F-statistic: 60.95 on 1 and 10 DF, p-value: 0.00001458
predict(m1, newdata = data.total1)
## 1 2 3 4 5 6 7 8
## 1176308.3 1149603.7 1122899.2 1096194.7 1069490.1 1042785.6 1016081.1 989376.5
## 9 10 11 12
## 962672.0 935967.5 909262.9 882558.4
data.total.1.predict <- data.frame(year = 2019:2024, Total = NA_integer_, round(predict(m1, newdata = tibble(Year = 2019:2024), interval = "confidence"), digits = 0))
data.total.1.predict
## year Total fit lwr upr
## 1 2019 NA 882558 833066 932051
## 2 2020 NA 855854 799759 911948
## 3 2021 NA 829149 766222 892077
## 4 2022 NA 802445 732521 872368
## 5 2023 NA 775740 698702 852779
## 6 2024 NA 749036 664793 833278
ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line() + geom_smooth(formula = "y~x", method = "lm") + geom_line(data = data.total.1.predict, mapping = aes(x = year, y = fit), linetype = "dashed", color = "red") + geom_ribbon(data = data.total.1.predict, mapping = aes(ymin = lwr, ymax = upr, x = year), fill = "red", alpha = 0.2) + scale_x_continuous(breaks = c(2008:2024)) + scale_y_continuous(breaks = seq(700000, 1300000, by = 100000), labels = seq(7, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") + theme(
plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(size = 10),
legend.text = element_text(size = 10)
)
To summarize, I found that the number of animals used in biomedical research has seen a decreasing trend over the years 2008-2019, with a reduction of 24%. A point to note is that this is the data only for species covered under the Animal Welfare Act and hence it may not be representative of all the animals used. It is very likely that in reality the number of increased. Improvements in genetic engineering lead to easy creation of transgenic mice and rats in the 1980s which are estimated to be over 95% of the animals used in research today. This could have resulted in an ongoing decline in the usage of other mammals, as seen in the patterns for dogs and non-human primates for example.
California is the top state in terms of total numbers as well as across most pain categories. Among the top 10 states, 6 of them show a decreasing trend, including California.
The improvements in non-animal methods are also likely to have caused the decline in animal usage. oOr example, in the cosmetic industry, the “cruelty-free” tag gained traction in the public eye and led to an ongoing transformation in the beauty industry to shift away from performing toxicity testing on rabbits to using in vitro human skin models.
Through this project, I used many of the concepts I learned in BMIN 503. I was able to appreciate the work that goes into cleaning and transforming data to a usable form. I was able to make custom plots and GIS maps to reflect my data and build a simple linear model.